chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 by andygrove · Pull Request #4476 · apache/datafusion-comet

andygrove · 2026-05-27T23:23:27Z

Which issue does this PR close?

Closes #.

Rationale for this change

Continuation of the per-category expression audit. Same pattern as #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated audit-comet-expression skill in #4468.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to crc32, hash, md5, sha, sha1, sha2, and xxhash64. sha is a registry alias of Sha1. Spark 4.0 only adds the DefaultStringProducingExpression trait and the nullIntolerant: Boolean field refactor on the four String-producing expressions (Md5, Sha1, Sha2, Crc32); no runtime behaviour change across the category.

Support-level consistency fixes (in `hash.scala`)

Refactor HashUtils to return reasons (unsupportedReasonFor, supportLevelForChildren, unsupportedReasons) instead of calling withInfo from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time, which the audit skill calls out as the canonical antipattern.
CometXxHash64, CometMurmur3Hash, CometSha1, CometSha2: override getSupportLevel and getUnsupportedReasons so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach both the dispatcher's EXPLAIN message and the compatibility doc generator.

Tracking issues filed for follow-up

None. The TimeType gap (Spark 4.0+) is covered by the existing #4418 EPIC; the DecimalType-precision-18 gap is a documented semantic difference (Spark hashes via Java BigDecimal), already declared by the new HashUtils.unsupportedReasons.

Audit process

Audited directly using the audit-comet-expression skill (4 Spark versions per #4468). Four serde objects plus the shared HashUtils helper.

How are these changes tested?

./mvnw test -Dsuites="org.apache.comet.CometHashExpressionSuite" -Dtest=none (37 tests pass)
make core succeeds with the serde refactor.

…, 4.1.1 Add per-version audit sub-bullets to `crc32`, `hash`, `md5`, `sha`, `sha1`, `sha2`, and `xxhash64` in `docs/source/contributor-guide/spark_expressions_support.md`. `sha` is a registry alias of `Sha1`. Spark 4.0 only adds the `DefaultStringProducingExpression` trait and the `nullIntolerant` field refactor across this category; no runtime behaviour change. Apply support-level consistency fixes surfaced by the audit: - Refactor `HashUtils` to return reasons (`unsupportedReasonFor`, `supportLevelForChildren`, `unsupportedReasons`) instead of calling `withInfo` from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time. - `CometXxHash64`, `CometMurmur3Hash`, `CometSha1`, `CometSha2`: override `getSupportLevel` and `getUnsupportedReasons` so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach the dispatcher and the compatibility doc. No correctness divergences were found, so no new tracking issues are filed. The known `TimeType` gap (Spark 4.0+) is covered by the existing apache#4418 EPIC; the `DecimalType`-precision-18 gap is a documented Spark semantic difference (BigDecimal hashing).

andygrove · 2026-05-28T22:56:13Z

No deferred follow-up work from this audit. The two known limitations (TimeType for Spark 4.0+, DecimalType precision > 18) are already declared via HashUtils.unsupportedReasons and flip the support level to Unsupported per skill rule 12; TimeType is tracked by the existing #4288 EPIC. The Sha2 non-foldable numBits restriction is now modeled in getSupportLevel rather than convert-time withInfo, satisfying skill rule 10.

kazuyukitanimura · 2026-05-29T21:55:52Z

+  private def unsupportedReasonFor(dt: DataType): Option[String] = dt match {
+    case d: DecimalType if d.precision > 18 => Some(unsupportedDecimalReason)
+    case s: StructType =>
+      s.fields.iterator.flatMap(f => unsupportedReasonFor(f.dataType).iterator).toSeq.headOption


If unsupportedReasonFor(f.dataType).iterator returns None for the first element of fields, will s.fields.iterator.flatMap(f => unsupportedReasonFor(f.dataType).iterator).toSeq.headOption return None?

Do we need to make sure all fields return None?

This will return the first Some item, or None if they are all None

andygrove added this to the 0.17.0 milestone May 28, 2026

kazuyukitanimura reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476

chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-hash-funcs

andygrove commented May 27, 2026

Uh oh!

andygrove commented May 28, 2026

Uh oh!

kazuyukitanimura May 29, 2026

Uh oh!

andygrove May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Support-doc audit notes

Support-level consistency fixes (in hash.scala)

Tracking issues filed for follow-up

Audit process

How are these changes tested?

Uh oh!

andygrove commented May 28, 2026

Uh oh!

kazuyukitanimura May 29, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support-level consistency fixes (in `hash.scala`)